Parallel Programming/Lesson 3

Types of parallel communication patterns
- Map / Gather / Scatter
Stencil - tasks read inputs from a fixed neighborhood in an array
- 2D von neumann stencil: cross
- 2D moore everything in neighbors including diagonals
Transpose
Array of Structures = AoS
Structure of Arrays = SoA
Efficiency
- data reuse
- communication
Threads
- thread blocks – group of threads that solve a sub problem
- can pick sizes based on kernel
Blocks
- Streaming multiprocessors: SMs
- small 1 SM, large: 16
- SMs have simple processors
- GPU allocates blocks to SMs
  - give gpu thread blocks
- All SMS run in parallel independently
No guarantees from CUDA
- hardware can run blocks whenever it wants
- blocks must complete
GPU memory model
- global memory
- shared memory – at lock level
- host memory
- local memory
Synchronization
- barrier – point where all threads stop
Memory types
- __shared__
Threadblocks are organized in kernels
There's an implicit barrier between kernels
Computation / Memory Spaces / Synchronization == Cuda
Maximize arithmetic intensity (math / memory)
- more work per thread (useful compute)
- minimize time on memory per thread (time to access memory)
- local > shared >> global >>> cpu
- local = registers, l1 cache
Parameters are local memory
cudaMalloc, cudamemcpy, cudamemset
__shared__ are visible to the full threadblock
__syncthreads()
Can also coalesce access to global memory
- threads read/write contiguous memory locations
- "coalesced access"
atomicAdd, atomicMin, atomicCAS
Limitations of atomic
- add, subtract, xor, etc. – no mod, exponentiate
- only supports some types (mostly ints)
- can use atomiccas to implement anything
- still no ordering constraints
  - floating point arithmetic is not associative
- no magic, thread access is serialized
Avoid thread divergence
- branches in the threads
- if/else
- loops

Backlinks